Data and code management in research practise

Robert Turner, University of Sheffield RSE Team September, 2021

In this session…

Practical advice on:

  • Data file management
  • File naming
  • “Project” folders
  • Metadata

Acknowledgements

Heavily based on Reproducible Research Data and Project Management in R by Anna Krystalli, naming things by Jenny Bryan and Methods in Research Software Engineering by David Wilby.

About me

Bob Turner

Mix of software engineering and research experience.

RSE Team

RSE

13 RSEs, 35 projects / year worth ~£11m total

Why are we here?

Make life easier

Research is hard, let’s not make it harder.

Reproducibility

PLoS Medicine, 2005

Are most published research findings false?

Covid models

FAIR Priniciples

The Turing Way project illustration by Scriberia. Used under a CC-BY 4.0 licence. DOI: 10.5281/zenodo.3332807.

About you

What operating system(s) do you use?

What programming language(s) do you use?

Research process

Data Management

Data Management Plan

  • Start early. Make an RDM plan before collecting data.
  • Anticipate data products as part of your thesis outputs.
  • Think about what technologies to use.

It’s OK to ask for help

Some years ago, Tom Webb (@tomjwebb) asked for advice on Twitter. Some of the resulting conversation is included in this presentation…

Own your data

Take initiative & responsibility. Think long term.

Spreadsheets?

Do you agree?

Excel

But good for data viewing / entry, sometimes, perhaps…

Databases

Signposting

Have a look at the Carpentries Databases and SQL lesson or SQL for Ecology lesson.

Data formats

  • .csv: comma separated values.
  • .tsv: tab separated values.
  • .txt: no formatting specified.

What file formats do you need to work with?

Ensure data is machine readable

Andrea De Santis, unsplash.com

bad

bad

good

ok

  • could help data entry
  • .csv or .tsv copy would need to be saved.

Basic quality control

Use good null values, missing values are a fact of life:

  • Usually, best solution is to leave blank
  • NA or NULL are also good options
  • NEVER use 0. Avoid numbers like -999
  • Don’t make up your own code for missing values

Data security

Raw data are sacrosanct

Give yourself less rope

Photo by Jon Moore, unsplash.com

  • It’s a good idea to revoke your own write permission to the raw data file. Then you can’t accidentally edit it.
  • It also makes it harder to do manual edits in a moment of weakness, when you know you should just add a line to your data cleaning script.

Know your main copies

Photo: Pexels CC0

  • identify the main copy of files
  • keep it safe and and accessible
  • consider version control
  • consider centralising

How to avoid catastrophes

Backup: on disk

Backup: in the cloud

  • dropbox, googledrive etc.
  • if installed on your system, can programmatically access them through R
  • some version control

Backup: the Open Science Framework osf.io

  • version controlled
  • easily shareable
  • works with other apps (eg googledrive, github)
  • no command line
  • DOIs
  • pre-registration

Backup: Github

  • most solid version control.
  • keep everything in one project folder.
  • Can be problematic with really large files.

Good File Naming

Let’s face it…

  • There are going to be files
  • LOTS of files
  • The files will change over time
  • The files will have relationships to each other

It’ll probably get complicated

File organization and naming is a mighty weapon against chaos

  • Make a file’s name and location VERY INFORMATIVE about:
    • what it is,
    • why it exists,
    • how it relates to other things
  • The more things are self-explanatory, the better.

What works, what doesn’t?

NO

myabstract.docx
Joe’s Filenames Use Spaces and Punctuation.xlsx
figure 1.png
fig 2.png
JW7d^(2sl@deletethisandyourcareerisoverWx2*.txt

YES

2014-06-08_abstract-for-sla.docx
joes-filenames-are-getting-better.xlsx
fig01_scatterplot-talk-length-vs-interest.png
fig02_histogram-talk-attendance.png
1986-01-28_raw-data-from-challenger-o-rings.txt

Question

What makes a good file name?

Three principles for good (file) names

  1. Machine readable
  2. Human readable
  3. Play well with default ordering

Machine readable

  • Regular expression and globbing friendly
    • Avoid spaces, punctuation, accented characters, case sensitivity
  • Easy to compute on
    • Deliberate use of delimiters

Filtering and search through Globbing

In the following:

ls -lh *Plasmid*
*Plasmid*

is a glob.

Excerpt of complete file listing

Example of globbing to filter file listing

Search using Mac OS Finder

Delimit information with punctuation

Deliberate use of "-" and "_" allows recovery of metadata from the filenames:

  • "_" underscore used to delimit units of metadata I want to access later
  • "-" hyphen used to delimit words so our eyes don’t bleed

Splitting filenames by delimiters

This happens to be R but also possible in the shell, Python, etc.

Include important metadata

e.g. I’m saving a number of files of temperature data extracted at different resolutions (res) and for a number of months (month). Including these parameters in the filename allows me to use them to target files to read in.

write.csv(df, paste("variable", res, month, sep ="_"))
df <- read.csv(paste("variable", res, month, sep ="_"))

If it’s machine readable it’s:

  • Easy to search for files later
  • Easy to filter file lists based on names
  • Easy to extract info from file names, e.g. by splitting

Human readable

  • Comprehensible by people

Example: Which set of file(name)s do you want at 3 a.m. before a deadline?

Embrace the slug

If it’s human readable it’s:

  • easy to figure out what the heck something is, based on its name!

Play well with default ordering

  • Put something numeric first
  • Use the ISO 8601 standard for dates
  • Left pad other numbers with zeros

Examples: Chronological order and Logical order

Chronological order: Order by date / time

Dates

Dates

Use the ISO 8601 standard for dates: YYYY-MM-DD

Logical order: Put something numeric first

Left pad other numbers with zeros

If you don’t left pad, you get this:

10_final-figs-for-publication.R
1_data-cleaning.R
2_fit-model.R

which is just sad :(

Recap: Play well with default ordering

  • Put something numeric first
  • Use the ISO 8601 standard for dates
  • Left pad other numbers with zeros

Recap: Three principles for (file) names

  1. Machine readable
  2. Human readable
  3. Play well with default ordering

Go forth and use awesome file names :)

“Projects”

Where shall I put my data?

File systems

  • Linux / MacOS - home folder
  • Windows - documents folder

A project folder

myproject/
|
├── 01_data/
|   ├── 01_raw/
|   ├── 02_working/
|   └── 03_clean/
|
├── 02_scripts/
|
├── 03_figures/
|
├── 04_paper/
|
├── 05_presentation/
|
├── readme.md
|
└── license.md

R (rrtools)

analysis/
|
├── paper/
│   ├── paper.Rmd       # this is the main document to edit
│   └── references.bib  # this contains the reference list information
│
├── figures/            # location of the figures produced by the Rmd
|
├── data/
│   ├── raw_data/       # data obtained from elsewhere
│   └── derived_data/   # data generated during the analysis
|
└── templates
    ├── journal-of-archaeological-science.csl
    |                   # this sets the style of citations & reference list
    ├── template.docx   # used to style the output of the paper.Rmd
    └── template.Rmd

Dependency Management

Good to include:

  • Python requirements.txt, environment.yml etc.
  • Matlab .prj file (xml) question for Mathworks
  • R renv.lock - use renv package

Don’t write your own dependency management.

Follow conventions…

  • …of your programming language.
  • …of your research area.

Signposting

Metadata

What is metadata?

“Information that describes, explains, locates, or in some way makes it easier to find, access, and use a resource (in this case, data).”

Can you think of any examples of metadata?

Types of metadata

  • Descriptive enables identification, location and retrieval of data, often includes use of controlled vocabularies for classification and indexing.
  • Technical describes the technical processes used to produce, or required to use a digital data object.
  • Administrative used to manage administrative aspects of the digital object e.g. intellectual property rights and acquisition.

Elements of metadata

  • Structured data files: readable by machines and humans, accessible through the web
  • Controlled vocabularies: allows for connectivity of data (Dublin Core, NERC, re3data.org)

KEY TO SEARCH FUNCTION

  • By structuring & adhering to controlled vocabularies, data can be combined, accessed and searched!
  • Different communities develop different standards which define both the structure and content of metadata

Storing metadata

Anything is better than nothing!

  • readme.md - not machine readable
  • json, yml, xml - can potentially be human and machine readable

Signposting

More resources: The Turing Way

A lightly opinionated guide to reproducible data science https://the-turing-way.netlify.com

https://github.com/alan-turing-institute/the-turing-way

Summary

  • Have a Research Data Management Plan.
  • Your data is a valuable research output!
  • Human and machine readability.
  • Help your future self, and others.